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Abstract 

We apply the method of defensive forecasting, based on the use of 
game-theoretic supermartingales, to prediction with expert advice. In the 
traditional setting of a countable number of experts and a finite number 
of outcomes, the Defensive Forecasting Algorithm is very close to the well- 
known Aggregating Algorithm. Not only the performance guarantees but 
also the predictions are the same for these two methods of fundamentally 
different nature. We discuss also a new setting where the experts can give 
advice conditional on the learner's future decision. Both the algorithms 
can be adapted to the new setting and give the same performance guar- 
antees as in the traditional setting. Finally, we outline an application of 
defensive forecasting to a setting with several loss functions. 

1 Introduction 

The framework of prediction with expert advice was introduced in the late 
1980s. In contrast to statistical learning theory, the methods of prediction with 
expert advice do not require statistical assumptions about the source of data. 
The role of the assumptions is played by a "pool of experts" : the forecaster, 
called Learner, bases his predictions upon the predictions and performance of 
the experts. For details and references, see the monograph [6]. 

Many methods for prediction with expert advice are known. This paper deals 
with two of them: the Aggregating Algorithm [21] and defensive forecasting [35] . 
The Aggregating Algorithm (the AA for short) is a member of the family of 
exponential-weights algorithms and implements a Bayesian-type aggregation; 
various optimality properties of the AA have been established [35]. Defensive 
forecasting is a recently developed technique that combines the ideas of game- 
theoretic probability [21) with Levin and Gacs's ideas of neutral measure [10U16] 
and Foster and Vohra's ideas of universal calibration [3]. 



The idea of defensive forecasting comes from an interpretation of probability 
with the help of perfect information games. The Learner develops his strategy 
modeling a game where a probability forecaster plays on the actual data against 
an imaginary opponent, Sceptic, that represents a law of probability. The cap- 
ital of Sceptic tends to infinity (or becomes large) if the players' moves lead to 
violation of this law. The capital of a strategy for Sceptic as a function of other 
players' moves is called a (game-theoretic) supermartingale. It is known (see 
LemmaUin this paper) that for any supermartingale there is a forecasting strat- 
egy that prevents this supermartingale from growing ( "defending" against this 
strategy of Sceptic), thereby forcing the corresponding law of probability. The 
older versions of defensive forecasting (see, e.g., [2 6) ) minimize Learner's actual 
loss with the help of the following trick: a forecasting strategy is constructed so 
that the actual losses (Learner's and experts') are close to the (one-step-ahead 
conditional) expected losses; at each step Learner minimizes the expected loss 
(that is, the law of probability used in this case is the conjunction of several laws 
of large numbers). This paper gives a self-contained description of a different 
version of the defensive forecasting method. We use certain supermartingales 
and do not need to talk about the underlying laws of probability. 

Defensive forecasting, as well as the AA, can be used for competitive online 
prediction against "pools of experts" consisting of all functions from a large 
function class (see [27j [28]). However, the loss bounds proved so far are gen- 
erally incomparable: for large classes (such as many Sobolev spaces), defensive 
forecasting is better, whereas for smaller classes (such as classes of analytical 
functions), the AA works better. Note that the optimality results for the AA 
are obtained for experts that are free agents, not functions from a given class; 
thus we need to evaluate the algorithms anew. This general task requires a 
deeper understanding of the properties of defensive forecasting. 

In this paper, the A A and defensive forecasting are discussed in the simple 
case of a finite number of outcomes. Learner competes with a countable pool 
of Experts 8. Experts and Learner give predictions and suffer some loss at 
each step. A game is a specification what predictions are admissible and what 
losses a prediction incur for each outcome. For every game, we are interested in 
performance guarantees of the form 

VfleeViV L N <cL e N + a , 

where Ln is the cumulative loss of Learner and L e N is the cumulative loss of 
Expert 9 over the first N steps, c is some constant and a 6 depends on 9 only. 
Section [2] recalls the AA and its loss bound (Theorem[T]) and introduces notation 
used in the paper. 

Section [3] presents the main results of the paper. Subsection 13.11 describes 
the Defensive Forecasting Algorithm (DFA), which is based on the use of game- 
theoretic supermartingales, and its loss bound (Theorem [5]) ■ It turns out that 
if the AA and the DFA are both applicable to a game, they guarantee the same 
loss bound. Subsections I3.3t - f376l discuss when the DFA and the A A are applica- 
ble. Loosely speaking, if the DFA is applicable then the AA is applicable as well 
(Theorem [9]); and for games satisfying some additional assumptions, if the AA 
is applicable then the DFA is applicable (Theorems Q2] and Theorem I2TJ1) . Sub- 
section [XT] gives a criterion of the AA realizability in terms of supermartingales 
(Theorem [22]) using a rather awkward variant of the DFA. The construction of 
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the supermartingales used in this paper involves a parameterization of the game 
with the help of a proper loss function. Proper loss functions play an impor- 
tant role in Bayesian statistics, and their meaning in our context is discussed in 
Subsections 13.41 and 13.61 

The rest of the paper is devoted to modifications of the standard setting. 
Subsection 13 . 81 applies the DFA in an extended setting where the outcomes form 
a finite-dimensional simplex. Section U introduces a new setting for predic- 
tion with expert advice, where the experts are allowed to "second-guess" , that 
is, to give "conditional" predictions that are functions of the future Learner's 
decision (cf. the notion of internal regret [9]). If the dependence is regular 
enough (namely, continuous) , the DFA works in the new setting virtually with- 
out changes (Theorem 1^5)) . The AA with some modification based on the fixed 
point theorem can be applied in the new setting too (Theorem |2"9")) . Section [5] 
briefly outlines one more application of the DFA: a setting with several loss 
functions. 

Some results of the paper appeared in [29] and in ALT'08 proceedings @]. 

2 Games of Prediction and the Aggregating Al- 
gorithm 

We begin with formulating the setting of prediction with expert advice. A game 
of prediction consists of three components: a non-empty set A of possible out- 
comes, a non-empty set T of possible decisions, and a function A : T x A — > [0, oo] 
called the loss function. In this paper we assume that the set A is finite. 

The set A = {g £ [0,oo]° | 37 £ TVo; £ A g{oj) = } is called the 

set of predictions of the game. In this paper, we will identify each decision 
7 £ r with the function uj H> X(j,uj) (and also with a point in a | A|-dimensional 
Euclidean space with pointwise operations). A loss function can be considered 
as a parameterization of A by elements of T. To study the properties of a game, 
we do not need to know the decision set T and the loss function; we can forget 
about them and consider the prediction set A only. From now on, a game will 
by specified by a pair (A, A), where A C [0,oo] n . We will use the letter 7 (as 
well as g) with indices to denote elements of [0,oo] n (rather than decisions). 

However, loss functions remain a convenient method to specify a game, and 
we will use them in examples. Also an important technical tool will be a kind 
of canonical parameterization of A given by the so called proper loss functions. 
Also loss functions are unavoidable in Section [5j where we consider games with 
several simultaneous losses. 

The game of prediction with expert advice is played by Learner, Experts, 
and Reality; the set ("pool") of Experts is denoted by 9. We will assume that 
is (finite or) countable. There is no loss of generality in assuming that Reality 
and all Experts are cooperative, since we are only interested in what can be 
achieved by Learner alone; therefore, we essentially consider a two-player game. 
The game is played according to Protocol [TJ 

The goal of Learner is to keep L n smaller or at least not much greater than 
L 6 n , at each step n and for all 9 £ 6. 

To analyze the game, we need some additional notation. A point g £ [0, 00] 
is called a superprediction in the game (A, A) if there is 7 £ A such that 
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Protocol 1 Prediction with Expert Advice 



L := 0. 

L% := 0, for all 6 € 6. 
for n = 1, 2, . . . do 

All Experts 8 G Q announce 7^ G A. 

Learner announces 7„ G A. 

Reality announces u„ G f2. 

L„ := L„_i + 7„(w„). 

L* := L*_ 1 +7*K), for aJl 
end for 



7( w ) < ff( w ) f° r all It is convenient to write the last condition as 7 < g. 

In the sequel, we will use pointwise relations and operations for the elements of 
[0,oo] without special mentioning. 

For a game (fi,A), denote by Sa the set of all superpredictions. Using 
operations on sets, this definition can be written as Ea = A + [0, oo]° = 
{7 + <7 I 7 G A, g G [0,oc] }. 

The Aggregating Algorithm is a strategy for Learner. It has four parameters: 
reals c > 1 and 77 > 0, a distribution Po on O (that is, Pq(&) G [0, 1] for every 
G O and Ylee&Po(6) = 1), and a substitution function a: Sa — > A such that 
o-(g) < 9 for any g G S A - 

At step TV, the A A computes gM G [0, 00] by the formula 



r( W ) = ~ln 



5A ,(w) = -- In ( >J Pn i{0) exp (_ 777 r ( a; )) j ; 



where 

N-l 



P N -i(d) = P (8) l[ exp(-777^( Wn )) 



is the (posterior) distribution on 0. Then, 7^ = cr(<Mr) is announced as 
Learner's prediction. 

The step N of the AA can be performed if and only if gM is a superprediction 
(gisr G Sa), that is, if 



r M<-^m(£ 

' Veee 



3 7a , G A Vw 7 wH < — In ^ v p (an «p(-»r&M) • (1) 

We say that the AA is (c, rj)- realizable (for the game (f2, A)) if condition ([1]) 
is true regardless of O, AT, 7^ G A and Pzv-i (that is, regardless of Po, the 
history of the previous moves, and the opponents' moves at the last step). This 
requirement can be restated in several equivalent forms: for any finite set G C A 
and for any distribution p on G, it holds that 



3 7 G A 7 < — In V p(g) exp(- V g) ; (2) 
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or equivalent ly, for any finite G C Ea and any distribution p on G, it holds that 



equivalently, in the last formula < can be replaced by =. Indeed, the condi- 
tion ((!]) implies ([2]) since 7^ and Pq are arbitrary; GCA can be replaced by 
G C Ea since the right-hand side of ([2]) increases when elements of G increase; 
by definition, ^ means that its right-hand side belongs to Ea, and we get © 
with = instead of <. Clearly, (TTJ) follows from ©, if we allow countably infinite 
G as well (then we can take {7^ | 9 € 0} for G), which is possible due to the 
following property of convex sets. 

For a given 77, the exp-convex hull of Ea is the set E A D Ea that consists of 
all points in [0, 00] n of the form 



where G is a finite subset of Ea and p is a distribution on G. Actually, 
cxp(— f?E A ) is the convex hull of exp(— t/Ea)- As known from convex analysis, 
we get the same definition if we allow infinite G (see e.g. [21 Theorem 2.4.1]). 
With this notation, the condition ([3]) says that Ea 2 cE A . 

Let us state some properties of the set E A . First, E A = E A + [0,oo] , that 
is, if E A is a prediction set then its superprediction set is E A itself. (Indeed, 
if a point go of the form Q belongs to E A as a combination of gi G G C Ea 
then, for any g G [0, 00] °, the point go + g belongs to E A as the combination 
°f Si + 9-) The set exp(— ^E A ) is convex (clearly, the points of the form ^ 
belong to E A also if we allow G C E A ). The convexity of exponent implies 
that the set E A is convex as well (if g\,g2 S E A then ag± + (1 — a) 52 > 
— - ln(aexp(— T}gx) + (1 — a) cxp(— 77.92)) and hence agi + (1 — a).92 G E A too). 

The game (f2, A) is called g-mixable if the AA is (1, -^-realizable, that is, if 
Ea = E A . The game is mixable if it is 77-mixable for some 77 > 0. The mixable 
games are of special interest. In a sense, the AA works with mixable games 
only, and to any non-mixable game (fi, A) the AA assigns the ?7-mixable game 
(fi, E A ) and then simply transfers the loss bound (at the price of a constant 
factor). Standard examples of mixable games are the square loss game [25j 
Example 4], which is 77- mixable for 77 G (0, 2], and the logarithmic loss game [231 
Example 5], which is 77-mixable for 77 G (0, 1]; see Subsection 13.21 A standard 
example of a non-mixable game is the absolute loss game [251 Example 3] with 
the loss function X(p,uj) = \p — oj\, p G [0, 1], ui G {0, 1} (its prediction set A 
is {(x,y) G [0, l] 2 I x + y = 1}); for the absolute loss game, the AA is (0,77)- 
realizable for 77 > and c > 77/(2 ln(2/(l + e" ?? ))). 

A detailed survey of the AA, its properties, attainable bounds and realizabil- 
ity conditions for a number of games can be found in 25 . Here we reproduce 
the proof of the main loss bound in the form that motivates our further study. 

Theorem 1 ([24]). If the AA is (c, 77) -realizable then the AA with parameters 
c, g, Po, and a guarantees that, at each step N and for all experts 9, it holds 




(3) 




L N < cL% + - In 
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Proof. We need to deduce the performance bound from the condition (JXJ) . To 
this end, we will rewrite (JlJ and get a semi-invariant of the AA — a value that 
does not grow. Indeed, the inequality ([!]) is equivalent to 



Y, p N-i(0) > ^PAr-i(0)exp(-77 7 ^H)exp( 
eee see 



Multiplying both sides by n^L/ ex P (f 7n( w n)) (which is independent of 8 and 
hence can be placed under the sum), and expanding P/v_i, we get 



JV-l JV-l 



Po(d) II exp(-^K)) J] ^P ( 

JV-1 N-1 

> E P o( ) II exp(-^K)) J] exp ( -7n(^n) 
eee n=i n=i 

(-?77wM) exp Q 7JV ( W )) 



x exp 

that is, 



£ PoWQn-iW) > E MWn-iV) exp ( (^^M _ ^( w ) 



see eee 
where Qn-i is defined by the formula: 

/ N-1 



Q w _ l( 0)=expU£f^^- 7 *K) 

\ n— 1 ^ 

That is, the condition ((TJ) is equivalent to 

3 1n eAVlo £P o (0)Qiv(0) < £p o (0)Qjv-i(0), ( 5 ) 
eee eee 

where Qat is the result of substituting uj for wat in Qjy. 

In other words, the AA (when it is (c, ^-realizable) guarantees that af- 
ter each step n the value X^eee Po{@)Qn{0) does not increase whatever uj n 
is chosen by Reality. Since X^eee Po(6)Qo(0) = Seee^o(^) = 1, we get 
J2ee0 P o( d )QN(S) < 1 and Q N (0) < 1/P O (0) for each step N. To complete 
the proof it remains to note that 

Q N (9)=expL(^--L N i 

□ 

For c = 1, the value - In (J^g Pq(0)Qn (#)) is known as the exponential 
potential (see [6j Sections 3.3, 3.5]) and plays an important role in the analysis 
of weighted average algorithms. In the next section we show that the reason why 
condition (J5J) can be satisfied is essentially that the function J2e Po(8)Qn(0) is 
a supermartingale. 
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3 Super martingales and the AA 



Let V(£i) be the set of all distributions on f2. Note that since il is finite we 
can identify V{Q) with a — l)-dimensional simplex in Euclidean space K' n ' 
equipped with the standard distance and topology. Let E be any non-empty set. 
A real- valued function S defined on {E x V{£1) x il)* is called a (game-theoretic) 
supermartingale if for any N, for any ei, . . . , ejv € -E, for any tti, . . . , 7Tjv € 'P(fi), 
for any u)%, . . . , wjv_i G f2, it holds that 

> , 7r Ar(w)S'(ei, 7Ti, wi, . . . , ejv-i)7rjv_i, Wjv-i, e-N, fjv, w) 



For AT = 1, the argument of S 1 in the right-hand side is the empty sequence, 
and we treat S() as a real constant. The intuition behind the definition is the 
following: there is a sequence of events tu n , each event is generated according 
its own distribution 7r„ selected (or revealed) at each step anew; when the event 
happens we compute the next value of S depending on the outcomes of the 
previous events, the previous distributions and some side information e„; the 
supermartingale property of S means that the expectation of the next value 
(when the distribution 7r n has been selected but the outcome is not known yet) 
never exceeds the previous value of S. 

Remark 2. The notion of a supermartingale is well-known in the probability 
theory. Let X\,X2, ■ ■■ be a sequence of random elements with values in tt. 
Denote by x n some realization of X n , n = 1, 2, . . ., and let n n be a conditional 
distribution of X n given X\ — X\, . . . ,X n -\ — x n -\. If wc fix some values for 
e n and substitute X n for oj n in 5, we can rewrite condition ([5]) as 



(the parameters e„ and 7r„ in S are omitted). Wc get the usual definition 
of a (probabilistic) supermartingale Sn — S(Xi, . . . , Xjy), N = 1,2,..., with 
respect to the sequence X\, X 2 , ■ ■ .: 



In a sense, a game-theoretic supermartingale is a family of probabilistic super- 
martingales parameterized by some e n and also by probabilistic distributions 7r ra , 
where the latter serve as conditional probabilities of the underlying random pro- 
cess. 

Remark 3. A reader familiar with the supermartingales in algorithmic proba- 
bility theory may also find helpful the following connection. Let /«: fi* — > [0, 1] 
be a measure on f2°° (where 51* and 51°° are the sets of finite and infinite 
sequences of elements from f2). As defined in e.g. |17l p. 296], a function 
s : f2* — > M+ is called a supermartingale with respect to fi if for any TV and 
any cji, . . . ,ujn-i G it holds that 



wen 



< S(e 1 ,n 1 ,uj ll . . . , ejv-i,7rjv-i, wjv-i) ■ (6) 



ES(x 1: ...,x N -i,X N ) < 5(si,...,j;jv-i) 



E[Sn I Xi, . . . , Xn-i] < Sn~i ■ 
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where fi(u> \ u>x, . . . ,wa-i) = '"" w ( an d ■ • • i w n) means the mea- 

sure of the set of all infinite sequences with the prefix ui\ . . . u) n ). Let e n be any 
functions of oj\, . . . ,o; n _i. Let 7r„(w) be | Wi, . . . ,w n _i). Having substi- 
tuted these functions in any game-theoretic supermartingale S, we get a super- 
martingale with respect to /i in the algorithmic sense. 

A supermartingale S is called forecast- continuous if for any N, for any 
ei, . . . } ejv S E, for any 7Ti, . . . , 7Tjv-i € T'(fi), for any wi, . . . , Wjy-i, wjv S ^, the 
function 5(ei, 7Ti, Wi, . . . , ejv, it, lon) is continuous as the function of ir G P(f2). 

The main use of forecast-continuous supermartingales in this paper is ex- 
plained by the following lemma. 

Lemma 4. Suppose that S is a forecast- continuous supermartingale. Then 
for any N, for any ei,...,ejv € -E, /or any 7Ti, . . . , 7Tjv-i € / or <™2/ 

Wi, . . . , Wjv-l € it holds that 

3tt G 7 , (fi)Va; e f2 S(ei,7Ti,a;i, . . . ,eN,n,uj) < 

S(ei, 7ri, wi, . . . , ejv-i, 7Tjv_i, wjv-i) ■ 

Note that the property provided by this lemma is similar to the condition (|5|) , 
where the role of S with the first N — 1 triples of the arguments is played by 
J2oee Po{6)Qn-i{6), the role of S{. . . , ejy, n,cj) (the left-hand side) is played by 
SeeG Po(8)Qn{0), the variable ir corresponds to -fjy, and for n = 1, . . . , N — 1, 
the parameters 7r„ and e„ are represented by j n and the vector of 7®, e 9, 
respectively. 

A variant of this lemma was originally proved by Levin [16] in the con- 
text of algorithmic theory of randomness. We will prove this lemma later (see 
Lemma [3]) , and in the next subsection we consider the Defensive Forecasting 
Algorithm, the main application of this lemma in our paper. 



3.1 Defensive Forecasting 

The Defensive Forecasting Algorithm (DFA) is another strategy for Learner in 
the game of prediction with expert advice. Let (fl, A) be a game. The DFA has 
five parameters: reals c > 1, r\ > 0, a (canonic) loss function A: V(£i) — > £a> a 
distribution Pq on 0, and a substitution function a: £a — > A such that (7(7) < 7 
for all 7 G Ea. 

Given A, c and 77, let us define the following function on (Sa X V(Q) x fi)*: 
Q{gi,Ki, Wi, . . . ,g Nl Tt N , u N ) = exp ^ - 5n(^„)^ . (7) 

To simplify notation, here and in the sequel we consider A as a function from 
■p(f2) x fi to [0,oo], that is, we write X(tt,uj) instead of (A(7r))(o;) and \(n, ■) 
instead of X(tt). For N = 0, we let QQ = 1 in accordance with the usual 
agreement that the sum of zero number of terms equals 0. Note that Q is 
similar to Qn{&) from the proof of Theorem [TJ with g n standing for 7^ and 
A(7r„, •) standing for j n . 
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Given also Po, let us define the function Q p ° on ((Sa) 6 x P(£l) x Q)* as the 
following weighted sum of Q: 

Q P °({li}eee, tti, cji, . . . , {7Ar}eee, tta, wa) = 

X! f 'o(^)<3(7i i ""I. ^l, • . . ,7at, ttw , Wat) . (8) 
see 

At step N, the DFA chooses any 7Ta G P(^) such that 

Voj e ft Q Po ({7i}eee,7Ti-wi, ■ • ■ , {TAj-ffee, tjv , w) < 

Q Po ({7i}e G e,7Ti,^i, ■ • ■ ,{7A-i}eee,7TAf-i,WA-i) , (9) 

stores this 7Ta for use at later steps, and announces 7a = (t(\(ttn ,•)) as 
Learner's prediction. 

Assume that the function Q defined by (J7J is a forecast-continuous super- 
martingale. Clearly, this implies that Q p ° defined by ([5} is also a forecast- 
continuous supermartingale for any Po- Then Lemma 0] guarantees that the 
DFA can choose 7Ta with the required property. 

Theorem 5. If Q defined by ^ is a forecast- continuous supermartingale for 
certain c, T], and A then the DFA with parameters c, r\, A, Po, and a guarantees 
that, at each step N and for all experts 9, it holds 

c 1 
L a < cL% + - In- 



Proof. The step of the DFA guarantees that at each step N the value of Q p ° 
does not increase independent of the outcome wji- Thus, the value of Q p ° at 
each step N is not greater than its initial value, 1. Since Q is always non- nega- 
tive and Q p ° as the sum of non-negative values can be bounded from below by 
any of its terms, we get 

MO) cxp L £ (h^A _ 7 * K) ) j < ! , 

and therefore 

N c 1 

V A(tt„, uj n ) < cL% + - In — — . 

t{ »? P ^ 

It remains to recall that j n = a(X(ir n , •)) < A(7r„, •), thus summing up we get 

L N < En=l H n n, W n ). □ 

In Subsections 13 . 3HX^1 we discuss general conditions when Q defined by 
is a supermartingale. In the next subsection we begin with examples for two 
widely used games of prediction. 

3.2 Two Examples of Supermartingales 

The logarithmic loss game is defined by the loss function 



A los (p,w) := 



— lnp if u> = 1, 

-ln(l-p) ifw = 0, 
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where ui £ {0, 1} is the outcome and p £ [0, 1] is the decision (notice that 
the loss function is allowed to take value oo). It is known [551 Example 5] 
that this game is 77-mixablc for r\ £ (0,1]. The corresponding prediction set 
is A log = {(x, y) £ K 2 I e~ x +c~ y = 1}. The losses in the game are ijv := 
J2n=i ^ lo6 (Pn, w n ) for Learner who predicts p n and L 8 N := J2n=i A '° S (Pn' w «) 
for Expert 9 who predicts p B n . Consider the following function: 




JV 

E( Alos (^ 



\ l °z(plu> n )) . (10) 



This function is actually Q defined by ([7]), where c = 1 and A log (p^, •) stands for 
g n . The only difference is that p n is not an element of "P(Sl). To fix this, let us 
assign Learner's decision p £ [0, 1] (and thereby prediction (— ln(l —p), — hip) £ 
A log ) to each distribution ir = (1— p,p) on {0, 1}. With this identification tt p, 
the expression (fT0|) specifies a function on ([0, 1] x V({0, 1}) x {0, 1})* with the 
arguments p n , ir n (represented by p n = 7T n (l)) and uj n . 

Lemma 6. For r\ £ (0,1], the function (fTD")) is a forecast- continuous super- 
martingale. 

Proof. The continuity is obvious. For the supermartingale property, it suffices 
to check that 

Pne v(-mp n +mpi) + (1 _ Pn)e ^(-Mi-P,)+in( 1 -Pn)) < 1 (11) 

Le., that pi-* (p e n y + (l- Pn ) 1 -i {l-pif < lforaUp n ,p»,77€ [0,1]. The last 
inequality immediately follows from the generalized inequality between arith- 
metic and geometric means: u a v 1 ^ a < au + (1 — a)v for any u, v > and 
a £ [0, 1], which after taking the logarithm just expresses that logarithm is con- 
cave. (Remark: The left-hand side of (jlll) is a special case of what is known as 
the Hcllinger integral in probability theory.) □ 

In the square loss game, the outcomes are u) £ {0, 1} and the decisions are 
p £ [0, 1] as before, and the loss function is A sq (p, lu) = (p — ui) 2 . It is known [25l 
Example 4] that this game is ?/-mixable for r\ £ (0,2]. The corresponding pre- 
diction set is A sq = {(x, y) £ [0, l] 2 | y/x + ^Jy = 1}. The losses of Learner and 

Expert 6 are L N := £ n=1 (p„ ~ w„) 2 and L% := Y,n=i(.Pn ~ w ™) 2 > respectively. 
With the same identification tt h> p, the following expression specifies a function 
on([0,l]xP({0,l})x{0,l})*: 

CX P ((P» - W «) 2 ~ (Pn ~ W «) 2 )^ ( 12 ) 

(again, note that it is a special case of Q defined by (J7J). 

Lemma 7. for r\ s (0,2], i/ie function (fT2"j) is a forecast- continuous super- 
martingale. 

Proof. It is sufficient to check that 
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for all p n ,Pn € [0, 1] and r\ G [0, 2]. To simplify notation, let us substitute p for 
p n and p + x for p s n . Then after trivial transformations we get: 

pe 2r,(l-p)x + ^ _ p ) e -2 W x < ^ Vx g j_ _ p j 

The last inequality is a simple corollary of the following well-known variant of 
Hocffding's inequality [Ul 4.16]: 

lnEe sX < sEX + — '- , 
8 

which is true for any random variable X taking values in [a, b] and for any 
s £ R; see [5J Lemma A.l] for a proof. Indeed, applying the inequality to the 
random variable X that is equal to 1 with probability p and to with probability 
(1— p), we obtain pexp(s(l — p)) + (1 — p) exp(— sp) < exp(s 2 /8). Substituting 
s := 2r/x, we have pexp(2i](l — p)x) + (1 — p) cxp(— 2r/px) < exp(7/ 2 x 2 /2) < 
exp(?7a; 2 ), the last inequality assuming r/ < 2. □ 

3.3 Supermartingales and the Realizability of the AA 

Our next goal is to find when Q defined by ([7]) is a supermartingale, depending 
on the parameters c, rj and A. Loosely speaking, we will show that the AA is 
(c, ^-realizable if and only if there exists A such that Q is a supermartingale. 
More precisely, the "only if" part holds for some class of games only. For 
arbitrary games, the equivalence holds if we relax slightly the supermartingale 
definition (see Theorem |2"21 . 

Let us begin with some notation. For any functions / : ft — > M and tt : ft — > K 
denote 

wen 

Actually, this is the scalar product of / and tt in MP. We will mostly use this 
for tt 6 'P(O); in this case E w / can be interpreted as the expectation of / over 
distribution tt. For functions g G [0, oo] Q and tt G V(fl), let 

uien, 7r(w)^o 

Recall that the function Q defined by ([7]) is a supermartingale if 

E7r(Q(gi,7ri, wi, . . . ,3Ar,7T, •) - Q(g 1 ,TTi,UJl, . . . , gN-l,^N-U<*>N-l)) < 

for any g\, 7Ti,u;i, . . . ,grsf-i, 7Tjv-i, (^n-i,9n and tt. The formula (J7J) can be 
rewritten as Q = IIn=i Qg™ ("'n,^n), where the functions q g : V(£l) x — > [0, oo] 
are defined by the formula 

q g {Tr,u) = exp (v( X ^^ ~ ffHj J ( 13 ) 

for any <? G Ea- Clearly, Q is a supermartingale if and only if E w <7 g (7r, ■) < 1 for 
all n eV(Q) and for all g G Sa. 

Let us say that a function q: 'P(O) x f2 — > K has the supermartingale property 
if for any tt G P(fi) 

E^(tt,.) < 1. 



11 



The function q is forecast- continuous if for every u> € Q it is continuous as the 
function of tt. 

So, Q defined by is a forecast-continuous supermartingale if and only 
if the functions q g defined by (fT5]) are forecast-continuous and have the super- 
martingale property for all g G £a- In the sequel, we will discuss the properties 
of q g instead of Q. Let us begin with a variant of Lemma SJ 

Lemma 8. Let a function q: 'P(O) x Q — > R be forecast- continuous. If for all 
7r G *P(£l) it holds that 

E^(tt,-) <C, 
where C G R is some constant, then 

3tt G 7>(fi) Vw G fl q(n, lj) < C . 

The proof of the lemma is given in Appendix. Here let us illustrate the idea 
behind the proof. Consider the function c/)(tv' , tt) = E v iq(w, •) and assume that it 
has the minimax property: min w max,^ 4>(tt', tt) = maxTr' min w <f>(n' , ir). Looking 
at the right-hand side, note that min w cj>(ir' , n) < <P(tt' , 7r') < C. Let tt minimize 
the left-hand side, then we get max T < E n 'q(TT, •) < C, that is, E w 'q , (7r, ■) < C for 
any n' , which implies the statement of the lemma if we consider distributions 
it' concentrated at each ui. 

Note that Lemma H] is a simple corollary of Lemma \E\ applied to C = and 

q(n,u) = 5(ei,7ri, Wi, . . . ,ejv,7r,a;) - 5(ei,7ri, wi, . . ., e N -i,n N -i,u N -i) . 

Now let us prove that if q g defined by Q13[) have the supermartingale property 
for all g G Ea (in other words, Q is a supermartingale) then the AA is realizable. 

Theorem 9. Let A map to £a> and let c > 1 and r\ > be reals such that 



are forecast- continuous and have the supermartingale property for all g G Sa. 
Then the AA is (c,rj) -realizable. 

Proof. Recall that the (c, ?7)-realizability is equivalent to the inequality © for 
any finite G C Sa and for any distribution p on G. Let us consider the following 
function: 



The function q is forecast-continuous and has the supermartingale property as 
a non-negative weighted sum of forecast-continuous functions with the super- 
martingale property. By Lemma [8] applied to this q and C = 1 , there exists 
7r G V(£l) such that q(ir,Lu) < 1 for all w, that is, 



After trivial transformations, we get the inequality @ with y(ui) replaced by 





gee 




X(tt,uj). It remains to note that A(vr, ■) G Sa- 



□ 
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3.4 Proper Loss Functions 

The functions q g denned by (|13[) have a loss function A as a parameter. In this 
subsection, we consider an important property of this loss function. 

A function A: V{£1) x 57 — > [0, oo] is called a proper loss function if for all 
tt,tt' eV{Q) 

E t A(tv) <E ;r A(7r',-), 

and A is strictly proper if for all tt ^ tt' the inequality is strict. 

The intuition behind this definition is the following. Assume that the out- 
come w is generated according to some distribution tt. Then the expected loss 
E„.A(7r', •) is minimal, if the prediction tt' equals the true distribution. Infor- 
mally speaking, proper loss functions encourage a forecaster to announce the 
true subjective probabilities. In a sense, if the loss function is proper then the 
predictions have a real, not just notational, probabilistic meaning. The proper 
loss functions are well-known in the Bayesian context; see [7] and |12j (note that 
these authors consider gains, or scores, instead of losses, so their notation differs 
from ours by the sign). 

We say that A is proper with respect to a set X C [0, oo] if for all tt G P(f2), 
it holds that A(ir, ■) G X and for all g S X it holds that 

Ett A(7r, •) < E^g 

(in other words, X(tt, •) G argmin ge x E^g). If the inequality holds for a fixed tt 
and all g G X, we will say that A is proper at tt. Clearly, if A is proper with 
respect to X then A is proper in the usual sense. The definition has a simple 
geometrical interpretation. The inequality means that the set X lies on one 
side of the hyperplane {x g R | XLen tt(w)x(w) = E 7r A(7r, •)}, and X touches 
the hyperplane at X(tt, •) G X. That is, X(n, •) is a point where A touches the 
supporting hyperplane with normal tt. 

Lemma 10. Let A map V(£l) to Ea and r\ > be such that the functions 

qg{TT,w) := e vMir,i»)-9{")) 

are forecast- continuous and have the supermartingale property for all g G Ea 
(the functions q g are just (|13[) with c = 1). Then A is a continuous proper loss 
function with respect to Ea- 

Proof. The continuity is obvious. Since e x > 1 + x for all x G K, we get 
Ewe v(XM- g ) > Et (i + v ( X (tt, .)-g))=l +r , (E,A(vr, •) - E^) , 

and from the supermartingale property we have E w A(7r, •) < E^g for all g G Ea 
and all tt G since ij > 0. (Remark: we get the strict inequality E„.A(7r, ■) < 

E„.g, if A(7r,a;o) 7^ g(uio) and 7r(wo) 7^ for some ujq G f2.) □ 

From Theorem [5] we know that the conditions of the last lemma imply also 
that the game (17, A) is ^-mixable. Let us show that the converse statement 
holds, i. e. the properness of A and mixability are sufficient for the supermartin- 
gale property. 
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Lemma 11. Suppose that the game (fi, A) is r}-mixable and A: V(£l) — > Ea is 
a proper loss function with respect to Ea . Then the functions 

q g (Tr,Lo) = e n(H*,")-9(«>)) 

have the supermartingale property for every g € Ea ■ If A is continuous then q g 
are forecast- continuous. 

Proof. The forecast-continuity is obvious. Assume that the supermartingale 
property does not hold, in other words, that E T e T,< - A ^ , '- ) ~ s - ) = 1 + 8 for some 
7r G 'P(O), i? e Ea and 5 > 0. For any e > consider the point 

g t = --In ((1 - e)e- riX ^ + ee"" 3 ) . 

The point g £ belongs to E A by the definition of E A , and E« = Ea since the 
game is ^-mixable, that is, g t € Ea for any e > 0. When e — > 0, we have 

9e = A(tt, •) - - In (l + e (e"^'-^^ - l) ) 

= A(tt, ■) - - (e T * XM -") - l) + 0(e 2 ). 

Taking the expectation E w , we get 

E„g e = E„\(n, •) - -E, (e"( A (^)-9) - l) + 0(e 2 ) = E W A(^, •) - - + 0(e 2 ) , 

where 5 > by our assumption. If e is sufficiently small then {8/rj)e > 0(e 2 ) 
and E,,.^ < E 7r A(7r, •), which is impossible since A(7r, •) is proper with respect 
to E A . □ 

An alternative, more geometrical proof of the last lemma for binary games 
the reader can find in 5] Lemma 3] . 

3.5 The Realizability of the AA and Supermartingales 

Theorem [9] shows that if the functions q g defined by (fT3|) are forecast-continuous 
and have the supermartingale property then the AA is realizable. We want to 
show the converse, that if the AA is realizable then one can find A such that 
the functions q g are forecast-continuous and have the supermartingale property. 
For mixable games, we know already that a proper loss function works (though 
we do not know yet whether a proper loss function exists) . In this subsection we 
show that we can obtain A in any game if we can construct continuous proper 
loss functions for mixable games. How to do the latter and when it is possible 
is discussed in the next subsection. 

To state and prove the main result of this subsection, we need two standard 
assumptions (see [25]) about the game (0,A) and some additional notation. 

Assumption 1. A is a compact subset of [0,oo] (in the extended topology). 

Assumption 2. There exists ga n £ A such that <?fi n (w) < 00 for all uj £ f2. 
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Note that if A is compact then Ea is also compact, as well as E A . A nice fea- 
ture of compact prediction sets is that the properties of the game are determined 
by the boundary of the prediction set. 

For any set X C [0, oo] , by A4X denote the set of minimal elements of X: 
.go £ M.X if and only if for any g £ X the inequality go > .9 implies go = g. 
For a compact set X, for every g £ X there is an element go £ A4X such that 
.9o < g\ that is, X C (A4Af + [0, oo]°). Notice that is contained in the 

boundary 9A~ of X. 

Since E A = E A + [0,oo] n = A + [0,oo]°, we have AfE A = TWA C A. For 
compact A, we have Ea = A4Ea + [0, oo] n = E^a = MA + [0, oo] n . Note also 
that a game is 77-mixable if and only if A4E A C A, since this is equivalent to 
E A = E A - A loss function is proper with respect to E A if and only if it is proper 
with respect to MT, 7 ^. 

Lemma 12. Suppose that the game {CI, A) satisfies Assumptions]^ and\^ and 
the A A is {c,rj) -realizable for this game. Then there is a continuous mapping 
V : E A — > <9£a such that V{g) < eg for all g £ S A . 

The proof is given in Appendix. The mapping V is actually the central 
projection from E A into the superprediction set Ea (which contains cE A when 
the AA is (c, ^-realizable) . 

Theorem 13. Let the game {CI, A) satisfy Assumptions Q] and[j^ the AA be 

{c,rj) -realizable for this game, and : V{Cl) — ¥ E A be a continuous proper loss 
function with respect to E A . Then for any continuous A: V{Cl) — > <9£a such that 
A(?r, •) < c\ 71 {-k, ■) for all it £ V{d), the functions q g defined by (TTBl are fore- 
cast-continuous and have the supermartingale property for every g £ Ea; and 
there exists a continuous A: T'(Cl) — ¥ <9Ea such that \{tt, •) < cX ^, •) for all 
n£V{Cl). 

Proof. The forecast-continuity is obvious. Let us check the supermartingale 
property, i. e., that 

E^i^- 3 ) < 1 

for all 7r £ V{Ct) and all g £ Ea- Since A(tt, •) < cA 1) (7r, •), it suffices that 

which follows from Lemma 1111 applied to the 77-mixable game (0,E A ) and the 
proper function A'' (note that Ea C E a , hence the lemma works for all g £ Ea). 

It remains to observe that \{n, •) = V(\ v (ir, •)), where V is defined in 
Lemma IT^l has the properties we need. □ 

3.6 Construction of a Continuous Proper Loss Function 

In this subsection, we fix a game (Q,A), fix rj > 0, and consider proper loss 
functions with respect to E A . They can be interpreted also as proper loss 
functions for the 77-mixable game {CI, E A ). 

Lemma 14. Let Ai and A2 be functions from V{Cl) to E7. Suppose that they 
are proper with respect to E A at some point ir £ V{Cl), that is, E w Ai(7r, •) < E^g. 
i = 1,2, for all g £ E A . Then for all cu £ CI we have 

■k{l})=^0 => \i{tt,uj) — \ 2 {n,uj) . 
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The proof of the lemma is given in Appendix. 

Let V°(fi) be the set of all non-degenerate distributions, i.e. 

V°{fl) = {tt g V{ft) | Vw e ft tt(w) > 0} . 

Lemma [Til implies that a proper loss function is uniquely defined on V°(fl). The 
following lemma gives a more explicit specification of the values of a proper loss 
function on V°(fl). 

Lemma 15. Let the game (ft, A) satisfy Assumptions [7] and\^ Let us define 
function H : M. n — > [—00, oo) by the formula 

H(ir) = min E n g . (14) 

Let H. be the domain where H is differentiable. Then % D V°(fl), and the com- 
ponents of the gradient of H at tt £ H PI V(fl) constitute a continuous function 
X-.nn V{ft) -> E A such that E 7r A(?r, ■) = H(%). Moreover, if n G V°(ft) then 
A(7r, •) is the unique point where the minimum in (|14l) is attained. 

Remark 16. The function H(tv) for 7r G V(ft) is known as the generalized 
entropy of the game (f2, A); see [13]. For the logarithmic loss game, H(ii) 
becomes the Shannon entropy of 7r (cf. (|16[1 ). It is worth mentioning that one 
can reconstruct the superprediction set Ea from the generalized entropy of the 
game, and also from the predictive complexity of the game (see [TB] for the 
definitions and proofs in the case of binary games) . 

The proof of the lemma is given in Appendix. The proof is based on the 
fact that the function —H(ir) is convex. Note that X(tt,-) G A4£ a for any 
7r G V°(n). Indeed, if for some tt G V°{n) we have A(tt, •) ^ A4E A then there 
exists g < \(ir, •), g G and g(ui) < X(n, •) for at least one ui. Since 7t(cj) > 

0, we get Ett^ < E lr A(7r, •) = H(tt), which contradicts the definition of H . 

Recall that if a loss function A is proper with respect to E A then E 7T A(7r, ■) = 
H(ir). Lemma [T51 shows that on "P°(^) a proper loss function A exists and it is 
unique and continuous. Our next task is to extend A continuously from V°(fl) 
to V(fl). Unfortunately, this is sometimes impossible. Consider an example. 

Let ft = {1,2,3}, and let the prediction set be 

A = {(-lnp,-ln(l-p),l) |pG [0,1]}. 

Actually, this is the binary logarithmic loss game with an additional dummy 
outcome. This game is 1-mixable and E A = Sa- It is easy to check that 
the proper loss function with respect to Ea is given on V°(ft) by the formulas 
A(7r, i) = — In nCi)+L(2) ' * = 1' ^' anc ^ '^( 7r ' ^) = ^' This function can be extended 
continuously to all 7r such that 7r(l) + tt(2) ^ 0, so we have \(n, ■) = (oo,0, 1) 
if 7r(l) = and \(ir, •) = (0, oo, 1) if tt(2) = 0. However, these continuations 
are inconsistent at the point ir — (0,0,1). Therefore, there is no continuous 
function on V(ft) which is proper with respect to Ea for this game. 

Now let us consider three examples of games where a continuous proper (and 
even strictly proper) loss function exists. 

The first example is the Brier game (see [21]), which is a generalization of 
the square loss game: 

A B (7r, W ) = £(<L( )-^( )) 2 
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where S u (o) = 1 if o = uj and 8^(6) — if o ^ cj. For the binary game 
fl = {0, 1}, distribution tt £ "P(f^) is pair (1 — p,p) where p £ [0, 1], and hence 
\ B (ir,ui) — 2(p — oj) 2 , which is twice the loss X sq (p,oj) = (p — cj) 2 in the binary 
square loss game as defined in Subsection 13.21 

The Brier game is 1-mixable, that is, E^ B = S^b for -q < 1. Let us calculate 
H(tt) defined by for tt £ P(O): 

H b (tt) = min E^o = min E,n = min E 7r A B (7r',-) 



= jgi&fi a - 7r '(°)) 2 
= i - e + ™ e (^m - ^h) 2 = 1 - E • 

a; (E f2 u; £ Q wGfi 

Clearly, H b {tt) is differentiable on 'P(fi), hence a continuous proper loss function 
for the Brier game can be computed as the gradient of H B by Lemma [TBI 
However, it is easier to note that the minimum of E 7r A B (7r',-) is attained at 
7r' = 7r only, and thus the standard form of the loss function A B is proper. 

Remark 17. Note that in the example above we computed the value of H(ir) 
assuming that tt £ ^(Vt). If we want to compute X(tt, to) as the partial derivatives 
of H(tt) with respect to n(ui), we must consider H{tt) as a function on &P (as 
stated in Lemma fT5|) . To this end, just note that H is homogeneous: 

H(tt) = H (—?—-) J2 ttM (15) 



for 7T € M . In the Brier game example we have 

H b (tt)=(i- Sgg«^ W ff ( 

and the partial derivatives are 

1 - 2tt(w) + E t 2 (o) = A b (tt,w) 

for any 7r £ V(il). In general, if we have a function 0: R° — ► M such that c/>(tt) = 
H(ir) for all tt £ 'P(fl), taking the derivatives of (TT51) we get that the proper loss 
function A can be computed by the following formula for any tt £ 'P(fi): 

A(tt, u) = - E + • 

oGfi 

where </>^ is the partial derivative of with respect to tt(lj). This formula is 
known from the Savage theorem [5D] (see also [T^l Theorem 3.2]; recall that they 
consider scores, or gains, —A instead of losses A). 

The second example is the Hellinger game: 
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Similarly to the Brier game, we can find that 




H*{tt) = mm Yl ^) i 1 - V^M) = E 

Here the minimum is not attained at tt' = tt and A H is not proper. Taking the 
derivatives, we find a proper loss function for the Hellinger game: 

X(tt,lu) = 1 



This loss function is known as the spherical loss. 

The spherical loss and the Hellinger loss specify the same game but under 
different parameterization. For binary games, this kind of "reparameterization" 
was considered in (T4] Section 3.1], where a proper function •) was called 
a Bayes-optimal prediction for bias tt. More precisely, the paper [H] discusses 
binary games specified by a loss function A(7, lj), where lo is or 1 and 7 e [0, 1] . 
Their Lemma 3.5 states conditions (on derivatives of A as a function of 7) when 
there exists a unique 7 P that minimizes (1— p)A(7, 0)+p\(j 1 1) for eachp £ [0, 1]. 
This 7 P can be obtained from Equation (3.8) in [14] : 



(1-P) ^-A( 7 ,0) 
(27 



+ P ^-A( 7 ,l) 

7=7 P ^ 



= 0. 

7=7p 



Our Lemma [15] can be regarded as a generalization of this approach. 
Our third example is the general logarithmic loss game defined by 

A log (7r,o;) =-1htt(o;). 

Similarly to the Brier loss function, the logarithmic loss function is strictly 
proper. Indeed, let us calculate the entropy H l ° s for tt £ V{£l): 



iJ log (vr) = min V 7r(w)(-ln7r'(£j)) 

= — tt(uj) In tt(lu) — max tt(u) In — r- = — tt(lu) In tt(ui) . (16) 

wen v wen v ' ^en 



Here the partial derivatives are infinite at the bound of V(iY). Nevertheless, it is 
easy to check that the minimum in the definition of H 1o6 (tt) is always attained 
at one point tt' = tt only. The last equality in (fT())) holds since logarithm is 

concave J2uen 7r ( w ) ln 1$J ^ ln [T,u,eQ ^H 1 ^) = and the inequality is 
strict unless are equal for all ui G Q or tt(ujq) = 1 for some ujq. In the former 

case, tt = tt' , since tt, tt' € V(fl). In the latter case, we get max„./ g -p(n) ln7r'(wo), 
which is attained if tt'(ljo) = 1, and hence tt = tt' too. 

Now we consider a general way to construct proper loss functions, even in 
the case when H is not diffcrentiable on all V(Cl). Note that the only way to 
extend A continuously is to define it at V(il) \ V°(Cl) as a limit from V°(Cl), 
where X(tt, •) is defined as a point of minimum. The following lemma proved in 
Appendix states that a limit of such points is again a point of minimum. 
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Lemma 18. Let iti G V(Sl) and 7, G A4E A ^ e SMC ^ E Wi 7, = min seS ^ E ni g, 
i = l,2,.... Assume t/ia£ 7r^ — > 7r and 7^ — >■ 7 as z — > 00. TTien 7 G A4E A and 
= min 9GS ^ E^g. 

In particular, the lemma implies that a continuous proper loss function exists 
in games where each minimum is attained in a unique point. Let us formulate 
this assumption explicitly and prove the existence theorem. 

Assumption 3. For every n G 'P(Sl) such that 7r(wi) = and 7r(w2) = for 
some u>i,LU2 G fi, u>\ 7^ u>2, there exists only one point where the minimum of 
E^g over all g G A4£ A is attained. 

Remark 19. Assumption [3] holds automatically for all binary games. The 
games with differentiable H, such as the general square loss game, satisfy As- 
sumption [3] as well. 

Theorem 20. Suppose that the game (Q, A) satisfies Assumptions^ and\^ and 
Assumption^ for certain r\ > 0. Then there exists a continuous loss function 
X: V(£l) — > A4£ A that is proper, and even strictly proper, with respect to £ A . 

Proof. Let us show first that the minimum of E w g over all g G A'JS^ is attained 
at one point only for all tt G "P(f^). For tt G V°(tl), it follows from Lemma IT4l 
Let tt G V{^i) be such that 7r(wo) = for some uj £ Q and n(oj) 7^ for 
(J 7^ cjo- Let gi, gi G A4£ A be any two points of minimum. Again by Lemma mi 
gi(u>) — g2{<^>) for all 10 7^ luq- Therefore g\ < gi or g\ > gi (since gi(wo) and 
172(^0) are comparable, being two reals), and the greater of them cannot belong 
to M.Yj\. Thus, gi — gi- Assumption [3] works for all other tt G "P(f2). 

Let us take X(tt, •) = argmin geA ^ E ^ E^g for all tt G V{£i). Clearly, A is proper 
with respect to £ A (recall that every point in £ A is minorized by some point 
in A4£ A ). Let us prove continuity. Take any converging sequence tti G let 
7r be its limit, and consider the corresponding Xfa, •). Lemma IT51 implies that 
all accumulation points of the set {A(7Tj, •)} are points where min geA/(E ^ E n g is 
attained, therefore X(tt, •) is the only accumulation point and X(ni, •) converges 
to A(tt, •). □ 

3.7 Defensive Forecasting Revisited 

Let us review the results we obtained so far. Theorems [T] and [5] gives us the 
same loss bound for a game (fi, A), if the AA is realizable and if Q defined by ([7]) 
is a forecast-continuous supermartingale, respectively. 

The algorithms are very close in their internal structure. We can say even 
more: with the same parameters and inputs, they give the same predictions, in 
some sense. More precisely, two sets coincide: the set of G A satisfying (|TJ) 
and the set of 7^ G A such that 7jv minorizes A(-7Tjv, ■) for ttn satisfying ©. 

Both algorithms are applicable under almost the same conditions: Theorem[5] 
says that if Q is a forecast-continuous supermartingale then the AA is realizable; 
Theorems [T2] and show the converse for games satisfying Assumptions HH21 

Whereas Assumptions [1] and [2] are standard and natural, and the AA is usu- 
ally considered only for the games satisfying these assumptions, Assumption[3]is 
new and quite cumbersome. However, it turns out that with the help of a more 
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complicated version of the DFA we can get rid of Assumption [3] and get a per- 
fect equivalence between the realizability of the AA and some supermartingale 
condition (under the standard Assumptions [1] and [2] only) . 

To begin with, let us slightly relax the definitions concerning supermartin- 
gales. We say that a function q: V°(Cl) x fl — > R has the supermartingale 
property on V°(Q) if for any tt £ "P°(f2) 

E^(tv) < 1. 

The function q is forecast- continuous on V°(fl) if for every w G it is continuous 
as the function of tt for all tt E "P°(fi). 

Lemma 21. Let a function q: V°(£l) x O — > K be non-negative and forecast- 
continuous on Suppose that for all tt € it holds that 

E T q(ir,-)<C, 

where C € [0, oo) is some constant. Then there exists a sequence {7r^}i e N such 
that 7i"W € V°(fl), the sequence 7i"W converges in V(fl), the sequences q(TT^ l \uj) 
converge for every uj £ Q, and 

Vwel] lim q(ir (t \uj) < C. 

i— >oo 

The proof of the lemma is given in Appendix after the proof of Lemma |5J 

Theorem 22. Let the game (fl, A) satisfy Assumptions [7] and\j^ The AA is 

(c, n) -realizable for this game if and only if there exists A such that the functions 
q g defined by (1131) are forecast- continuous on V°(fl) and have the supermartin- 
gale property on V°(£l) for all g G Ea- 

Proof. The "only if" part easily follows from Lemma [15] combined with (the 
proof of) Theorem [T3J 

The "if" part is analogous to Theorem [SJ We need to prove inequality © 
for any finite G C Ea and for any distribution p on G. Let us consider the 
function 

q(n,u}) = J2p(g)q g (TT,uj), 

g£G 

which is non-negative, forecast-continuous on V°(fl), and has the supermartin- 
gale property on V°(S}). By Lemma |2~T1 applied to this q and C = 1, there exist 
„-(*) e 7?° (ft) suc h that 

Vw € n Hm E p(g) cxp (n (— — — - g(oj)\ \ < 1 . 

Let 7 W = A(ttW,-) g S a . Since Ea is compact (by Assumption [T]), the se- 
quence 7W contains a convergent subsequence; let 7 € Ea be its limit. Then 
S g eG P(d) ex P { 7 l{l/ c ~ 9)) ls a limit of the corresponding convergent subse- 
quence of the sequence ^2 geG p{g) cxp (??(7^V C — an d f° r every uj e O we 
get inequality ([3]): 

g^exp^^-^)))^!. 

□ 
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Let us state also the algorithm DFA*, a variant of the DFA suitable for 
supermartingales on V°(fl). At step N, the DFA* defines the function 




( 



A(7T, Ul) 



c 




Then the algorithm chooses as 7 the limit of any convergent subsequence of the 
sequence X(n^\ •), and announces 7at = (7(7) as Learner's prediction. It is clear 
that the DFA* guarantees the same loss bound as Theorem [5j 

It is important for applications that the AA is rather efficient computation- 
ally (though it is more complicated than some other algorithms). The DFA* is 
designed to obtain a nice theory, and it makes little sense to discuss its efficiency. 
The DFA is much more practical then the DFA* . Unfortunately, the DFA seems 
to be less practical than the AA. Its main step hidden in the proof of Lemma [H] 
requires finding a fixed point (or a minimax), which is generally a hard task 
(PPAD-complete). For binary games, however, the fixed points can be found 
by bisection method, which gives us a not so inefficient implementation of the 
DFA. Some tricks can also help for games with three outcomes. 

Remark 23. After this paper had been finished, the authors have discovered 
another way to deal with games that do not satisfy Assumption [3] The idea is 
to consider a multivalued loss function: to every tt it assigns all points where 
the minimum of E^g is attained. The definition of supermartingale should 
be modified accordingly, and a variant of Lemma 2] can be proved for such 
multivalued supermartingales. The details will be added later or published 
elsewhere. 

3.8 On Continuous Outcomes 

We assumed so far that the space of outcomes, fi, is finite. However, it is often 
natural to consider a continuous space of outcomes. For example, for the square 
loss function X sq (p, uj) = (p — uj) 2 , one can take ui G [0, 1] instead of u € {0, 1}. 

In this subsection we consider one important case of continuous outcome 
spaces: a finite-dimensional simplex. We will consider a simplex as the space 
Viyi) of distributions on some finite il. A game of prediction is a pair ("P(f2), A), 
where A C [0, oo] v ^; predictions are functions 7: 'P(^) — > [0,oo]; the protocol 
is the same. Each game of prediction with the outcomes from a simplex V(fl) can 
be restricted to a game on fi: we identify each uj G with the distribution 5 U 
concentrated on this uj. Thus we may assume V(Q) D Denote by Aji C 
[0, oo]° the set of functions from A restricted to fl. 

We will show how the supermartingale technique works for games having 
some regularity property. (A similar extension for the AA is discussed in |141 
Section 4.1].) 
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To motivate this kind of property, let us start from the other side and assume 
that we have a prediction (recall that our prediction is a vector of our losses for 
every possible outcome) 7 defined on and want to extend it to V(Ct). The most 
natural way to do this is to say that an element of V(Cl) is just a probability 
distribution on the outcomes, and consider the expected loss with respect to this 
distribution, that is, j(p) := E p 7 for every p G V(£l). It is also natural to expect 
that having this property one should be able to transfer a regret bound from the 
game on f2 to the respective game on V{$1). However, the equality 7(7?) = E p 7 
is too restrictive. For example, it does not hold for the square loss. At the same 
time, what does hold for the square loss (and will be checked later) is an equality 
concerning the difference of two predictions: 71 (p) — 72(7?) = E p (7i— 72). This is 
quite natural in our context, since the difference is a regret, loosely speaking, and 
a regret is the value we are optimizing. This leads to the following requirement 
(formally weaker than the condition for the square loss). 

We say that A C [0, oo] 73 ^ has the relative exp- convexity property for certain 
c and 77 if for all 71,72 € A and for all p £ "P(^) it holds that 

exp _ 72(p) )) < ^ M exp _ 72(w) )) . 

Remark 24. The relative exp-convexity property for any c > and rj follows 
from 

V7 e AVp e -p(f>) 7(2?) = pMtH 

wen 

due to convexity of the exponent function. For c = 1 and any 77, it follows also 
from 

V7i,72 e A Vjj e V(0) 71b) - 72b) = ^ p( w )(7iM ~ 72 M) . 

Let do : Aq — > A be any mapping inverse to the restriction from A to Aq, 
that is, for any 7 e An, the function 0-0(7) G A restricted to SI is 7. Such a 
mapping exists since every element of Aq is a restriction of some element of A. 

Theorem 25. For a game (V(£l),A), suppose that A has the relative exp-con- 
vexity property for some c > 1 and 77 > 0. For the restricted game (f2, Aq), 
suppose that for some A: V(Q) — > Sa^, the functions q g defined by (|T5| are 
forecast- continuous and have the supermartingale property for all g G ^A n ■ Let 
a: Ea^, — > An be a substitution function (that is, cr(g) < g for all g €E ^A n )- 
Then for the game (V(fl),A) there is Learner's strategy (in fact, a variant of 
the DFA) with parameters c, rj, X, Pq, a, and ctq guaranteeing that, at each step 
N and for all experts 9, it holds 

c 1 

L N < cL ti N + - In 



Proof. Assume that we are at step N and need to announce the next prediction. 
Let 7^ € A, n — 1, . . . , N be the experts' prediction up to step N, 7„, n = 
1, . . . , N — 1 be the Learner's previous predictions, and p n , n = 1, . . . , N — 1 be 
the previous outcomes. Define the function Qn-i from to K. 

JV-l / / , 

ln\Pr, 



Q N -m= Il ex p h — -* 
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and consider the following function on 'P(fi) x f2: 

qjtfau) = Y,Po(0)Qn-i(0) x exp fJ^^±- 7 e N (u>) 

Due to the assumptions about the last multiplier, g^v is forecast-continuous and 
E 7r q , Ar(7r, •) < J2eee Po(6)Qn-i(6)- By Lemma|Sl we can find ttjv 6 ~P{Q) such 
that for all oj € Vt 

eee 

The prediction of the strategy is 7^ = an(a(\(TtN , •))) € A. 

Let pn € P(ty be the outcome at step TV. The relative exp-convexity 
property implies that 

( ( 1n(pn) 0, s\\ , , / (7n(u) e, \ 
exp ( 77I 7iV(Piv) J J < 2^Pw(^)exp I 77 1 — 7 ^(w) 

^ ^ c ' ' wen ^ ^ c 

We have 7 aj-(w) = ct(A(7Tjv, -))(w) by the definition of an, hence we have 7at(cj) < 
A(7Tjv,o;) by definition of a. Thus, 



Po(0)Qn(0) = Po(e)QN-i(0) x exp U^^± - j e N (p N ))) 
eee eee \ \ c > 

< £ Po(0)Qn-i(0) x £ Pn(u>) exp U X ^ N >^ - ^(u) 
eee wen \ \ c / 

= J2PN{u)ctN(irN,u) < ^P (6)Qn-i(0), 



wen eee 
and the loss bound follows as usual. □ 

As an example, let us again consider the Brier game (the general square 
loss function), now with distributions as outcomes: Q is a finite non-empty set, 
outcomes p are from V(£l), and the loss of decision ir S V(fl) for outcome p is 



A B Kp) = 5>H-7rH) 2 



wen 



It is easy to check that this game has the relative exp-convexity property for 
c = 1 and any 77 due to Remark 1241 



^pH(A B ( 7 r 1 , W )-A B (^ 2 , W )) 

wen 

- - ^M) + 2 PMMw) - TTiCw)) 

= A B (^,p)-A B (7r 2 ,p). 

Another important example is the Kullback-Leibler game (its restricted ver- 
sion is the logarithmic loss game): 

A KL ( 7 r,p) = ^pMln^. 

* — ' niuj ) 

wen v y 

This game also has the relative exp-convexity property for c = 1 and any 77: 
A KL (^,p) - A KL (^ 2 , P ) = EwenPH(A KL ^i, W ) - A KL (^ 2 , W )). 
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4 Second-Guessing Experts 



In this section, we apply the supermartingale technique and the DFA to a new 
variant of the prediction with expert advice setting. Protocol [5] is an extension 
of Protocol[T] where the game is specified by the same elements (fi, A) as before, 
but the Experts have a new power. 

Protocol 2 Prediction with Second-Guessing Expert Advice 
L := 0. 

L% := 0, for all 9e9. 
for n = 1, 2, . . . do 

All Experts 9 e announce P^ : A ->• A. 
Learner announces j n G A. 
Reality announces w„ S O. 
L„ := L n _i + 7n(w n )< 
L« := + r» ( 7n , w n ), for all e 9. 
end for 



The new protocol contains only one substantial change. Every Expert 9 
announces a function T e from A to A instead of an element of A (to simplify 
notation, we consider T also as a function from A x 17 to [0, oo], as we did with 
the proper loss functions A). Informally speaking, now an expert's opinion is 
not a prediction, but a conditional statement that specifies the actual prediction 
depending on Learner's next step. Therefore, the loss of each expert is deter- 
mined by the Learner's prediction as well as by the outcome chosen by Reality. 
We will call the experts in Protocol [2] second- guessing experts. Second-guessing 
experts are a generalization of experts in the standard Protocol [TJ a standard 
expert can be interpreted in Protocol [5] as a constant function. 

The phenomenon of "second-guessing experts" occurs, for example, in real- 
world finance. In particular, commercial banks serve as "second-guessing ex- 
perts" for the central bank when they use variable interest rates (that is, the 
interest rate for the next period is announced not as a fixed value but as an 
explicit function of the central bank base rate). 

In game theory, the notion of internal regret [HI 121 H21 122] is somewhat 
related to the idea of second-guessing experts. The internal regret appears in the 
framework where for each prediction, which is called action in that context, there 
is an expert that consistently recommends this action, and Learner follows one 
of the experts at each step. The internal regret for a pair of experts shows 
by how much Learner could have decreased his loss if he had followed expert j 
each time he followed expert i. This can be modeled by a second-guessing expert 
that "adjusts" Learner's predictions: agrees with Learner if Learner does not 
follow i, and recommends following j when the Learner follows i. 

The internal regret is usually studied in randomized prediction protocols. 
In the case of deterministic Learner's predictions, one cannot hope to get any 
interesting loss bound without additional assumptions. Indeed, Experts can 
always suggest exactly the "opposite" to the Learner's prediction (for example, 
in the log loss game, they predict 1 if Learner predicts p n ( "the probability of 
1") less than 0.5 and they predict otherwise), and Reality can "agree" with 
them (choosing the outcome equal to Experts' prediction); then the Experts' 
losses remain zero, but the Learner's loss grows linearly in the number of steps. 
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A non-trivial bound is possible if Learner is allowed to give predictions in the 
form of a distribution on Experts. This can be formalized as the Freund-Schapire 
game |25[ Example 7]. Then the second-guessing expert modeling an internal 
regret is a continuous transformation of the distribution given by Learner. The 
results of [3] and others are bounds of the form < L e N + 0(^/N) for the 
Freund-Schapire game, which is non-mixable. A discussion of bounds of this 
form achievable by the defensive forecasting method will be published elsewhere: 
in this paper we consider another kind of bounds. However, here we will also 
make the assumption that second-guessing experts modify the prediction of 
Learner continuously. 



4.1 The DFA for Second-Guessing Experts 

First consider the case when T n are continuous mappings from A to A. The 
DFA requires virtually no modifications for this task and gives the same loss 
bounds as in Theorem [SJ 

Theorem 26. Suppose that for some c, r\, and some continuous A: V(Q) — > A 
the functions q g defined by (|13[) are forecast- continuous and have the super- 
martingale property for all g £ A. Then for the game following the protocol of 
prediction with second-guessing expert advice where all experts 9 at all steps n 
announce continuous functions : A — >■ A, there is Learner's strategy (in fact, 
the DFA applied to Q p ° defined by (|17[) ) with parameters c, r], A, Pq, (where Pq 
is a distribution on j guaranteeing that, at each step N and for all experts 9, 
it holds 

Proof. For any continuous T : A — > A consider the function 
qr{ir,uj) = exp ( rj( X ^ n,UJ ^ - T(A(7r, -),cj) 



It is forecast-continuous as a composition of continuous functions, and has the 
supermartingale property since for any ir € taking g = r(A(7r, •)) we have 

E^r = E^ < 1. Similarly to ©, define Q p ° on ((C(A ->■ A))° x V(fi) x ft)*, 
where C(A — > A) is the set of continuous functions on A, by the formula 



N 



£ P (9) 1] -P U^^A _ r£(A(7r„, .), Un) 



9ee 



n=l 



(17) 



As in Theorem [SJ Q p ° is a forecast-continuous supermartingale. 

At step N, the strategy chooses any 7Tjv satisfying and announces 77V = 
A(7T/v,-) as Learner's prediction (we do not need a substitution function here 
since the range of A is in A by the theorem assumption). The loss bound 
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follows, since 

exp 77 2^1 r„(7„,w„) 

\ 71=1 ^ C 

exp Lfj^A - T e n (X(TT n , •),.„))) < ^ • 

□ 

Recall that Theorem [20] provides us (under Assumptions |T]^3]) with a con- 
tinuous proper loss function A: V(£l) —> A4£ A . For any 77-mixable game, we 
have C A, and due to Theorem [T31 we can take this A and get forecast- 

continuous q g with the supermartingale property. 

For non-mixable games there is no guarantee that such A exists. Theorem IT51 
gives a function A ranging over 3Sa (the boundary of the superpredictions 
set Sa), which is not necessarily contained in A. Moreover, it may happen that 
even for continuous experts : A — > A it is impossible to get any interesting 
loss bound, for any strategy. Indeed, consider a game where A is not connected 
(e.g., the simple prediction game (23] Example 1] with A = {(0, 1), (1, 0)}). 
Then the example with "opposite" predictions works: the experts just need to 
map Learner's predictions into another connected component. 

By this reason, let us consider a modification of Protocol [5] that changes 
the sets of predictions allowed for Learner and for Experts. Namely, for the 
game (CI, A), Experts 9 G announce T 6 n : dT,\ — > Sa, and Learner announces 
7„ G 9Sa (the rest of Protocol [3] does not change). We will assume that the 
game satisfies Assumptions [T] and [2] (for non-compact A the boundary dY>\ may 
be empty). Then the modified protocol usually gives more freedom to Learner: 
since A4A C <9Sa, the predictions in A \ i9Sa are minorized by some better 
predictions in A4A. The Experts are allowed to give predictions (which are 
r„(7„)) in a larger set Sa, however, they need to cope with Learner predictions 
from a larger set too. 

For the modified protocol, Theorem [26] holds with minimal changes: A is 
allowed to range over Sa instead of A, the functions q g have the supermartingale 
property for all g G Sa (instead of g G A only), T e n are continuous functions 
from <9Sa to Sa; the proof does not change. Theorem 1131 provides us with A 
such that q g have the required properties. 




4.2 The AA for Second-Guessing Experts 

In contrast to the DFA, the AA cannot be applied to the second-guessing pro- 
tocol in a straightforward way. However, the AA can be modified for this case. 
Recall that the AA is based on the inequality (HJ, which is already solved for Tw- 
in the second-guessing protocol, both sides of this inequality will contain 77V: 

1n{u n ) < --In ( V p N~i(0) exp(-7 ? r^,(7jv,a;Ar)) ] . 

The DFA implicitly solves this inequality in (the proof of) Lemma [4] using a 
kind of fixed point theorem. We will present a modification of the AA which 
uses a fixed point theorem explicitly. 
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A topological space X has the fixed point property if every continuous func- 
tion /: X — > X has a fixed point, that is, 3x G X f(x) = x. 

Let us show that if the game (A, J7) satisfies Assumptions [T] and [2] then the 
set MTi\ (the set of minimal points of E A ) has the fixed point property for any 
rj > 0. First consider the homeomorphism from [0, oo]° to [0,1]° that maps 
g i— >• exp(— rjg). As mentioned in Section [21 the set exp(— 7/E A ) is convex. It 
is non-empty due to Assumption [5] and compact due to Assumption [TJ Thus, 
exp(— r?E A ) has the fixed point property by [TJ Theorem 4.10], and E A has 
the property as its homeomorphic image {U Theorem 4.1]. Now we need the 
following technical lemma proved in Appendix. 

Lemma 27. There is a continuous mapping F : E A — > A4E A such that F(g) < g 



for any g G E 1 ' 



A 

A- 



Remark 28. Essentially, the main contents of Lemma [27] is a construction 
of a continuous substitution function. In many natural games, the standard 
substitution functions are continuous without additional efforts. 

The definition of A4E A implies that if F(g) < g then F(g) = g for any 
g G A4E A , and hence F defined in the lemma is a retraction (by definition, a 
continuous mapping from a topological space into its subset that does not move 
elements of the subset). Due to [TJ Theorem 4.2], since E A has the fixed point 
property, its retract A'lE^ has the fixed point property too. 

Theorem 29. Suppose that the game (fi,A) satisfies Assumptions]^ and\^ and 
is n-mixable. Then for the prediction with second-guessing expert advice protocol, 
there exists Learner's strategy (a modification of the A A) with parameters 7/ and 
Pq guaranteeing that, at each step N and for all experts 9, it holds 

e 1 1 
Ln < L N H — La - 



V Po(0) 

Proof. At step N, the modified AA announces as Learner's prediction 7^ any 
solution of the following equation with respect to 7 G A'JE^: 

PN - l{d) exp(-,l- (7))^ , (18) 




where T e N are announced by the experts, the weights Pn—i & re defined in the 
usual way with the help of the previous losses: 



N-x 



Pn-M = P (9) J] exp(- V T e n ( ln ,u n )), 

n=l 

and F is the continuous mapping from Lemma I2~T1 

Since for an ?7-mixable game we have E^ = Ea, and since A^Ea C A, the 
functions T N are defined on 7. By the definition of E A , the argument of F 
in equation (fT8")l belongs to E A , and F maps it to A4E A . The mapping is 
continuous as the composition of continuous mappings. Therefore, since A4E A 
has the fixed point property, equation (|18[) has a solution. 
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The property F(g) < g implies that 



and the usual analysis of the AA gives us the bound. □ 

Let us outline briefly how the construction of Theorem can be applied to 
non-mixable games under the modified second-guessing protocol (where experts 
are defined on dYl\). Let the AA be (c, ^-realizable. Now we are looking for 
7 € 9Ea satisfying the following equation: 

-KK^K£s^^ 0XP( - rtw, )))' (19) 

where after F we apply V, the mapping defined in the proof of Lemma [T2] 
Since V is continuous and maps E A to <9Ea, we get a continuous mapping 
of V(E A ) C <9£a into itself. It remains to show that V(E A ) has the fixed 
point property. Similarly to the proof of Lemma 1121 consider the set Z = 
{g € E A | Vr e [0, 1) rg £ E A }. For any g 6 E A , there exists a unique r such 
that rg € Z, and the continuity of this mapping g rg follows in the same way 
as in the proof of Lemma [TJl thus Z is a retract of E A and has the fixed point 
property. Since V(g) = V(rg) for any non-negative real r such that g and rg 
belong to E A , we have V(H\) — V(Z). The definition of Z implies that V is 
bijective on Z, and again as in the proof of Lemma |12I one can show that the 
inverse mapping V^ 1 : V(Z) — > Z is continuous. Therefore V(Z) has the fixed 
point property as the homeomorphic image of Z. 

Let 7jv S ^(^a) be any solution of the equation (ITT))) . By the properties of 
F and V, we have 

and the usual AA bound follows. 

5 Predictions with Respect to Several Loss Func- 
tions 

In this section, we illustrate the use of the supermartingale technique for another 
extension of Protocol [TJ a game with several loss functions (for a more detailed 
discussion of this setting see [5])- In contrast to the case of second-guessing 
experts, it is not clear yet whether the AA can help in this case. 

Up to now a game was (CI, A) where A was the set of admissible predictions, 
common for Learner and Experts. Here we return to the game specification by 
a loss function on the decision space V(ti). However, now each Expert 9 has its 
own loss function \ e . So, the game is specified by (Q, V(Cl), {A e }g e e), where 
\ e : V(fl) x fl — > [0, oo] are proper loss functions. The sets of predictions A(0) 
and superpredictions E A (g) may be different for different experts 9. The game 
follows Protocol El 
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Protocol 3 Prediction with Expert Evaluators' Advice 



L ( °> := 0, for all OeO. 
L% := 0, for all 6eO. 
for n = 1, 2, . . . do 

All Experts 9 G 9 announce €V(Cl). 

Learner announces 7r„ G 7- > (0). 

Reality announces w„ G f2. 

Li e) := ifi + A e (^,w„), for all G 6. 

L e n := + \ e {TT e nl uj n ), for all G 6. 
end for 



There are two changes in Protocol [3] compared to Protocol [TJ The accumu- 
lated loss L e of each Expert 9 is calculated according to his own loss function A . 
Learner does not have one accumulated loss anymore, but the losses of 
Learner are calculated separately for comparisons with each Expert 9 and ac- 
cording to the loss function of this Expert. 

Now it does not make much sense to speak about the best expert: their 
performance is evaluated by different loss functions and thus the losses may 
have different scale. What remains meaningful are bounds for every expert 9 of 
the form 

— C ' a > 

where c and a? may be different for different experts 9 G O. 

Informally speaking, Protocol [3] describes the following situation. We have 
some practical task and a number of prediction algorithms (they will be our 
Experts). Each of them minimizes some loss, maybe different for different al- 
gorithms. We do not know which algorithms fits our task best. As usual in 
practice, we do not have a loss that measures the quality of predictions for our 
task; we only know that predictions must be close to the real outcomes. A safe 
option in this case would be to predict in such a way that our prediction are 
not bad compared to predictions of any of the algorithms even if the quality is 
evaluated by the loss function ascribed to this algorithm. 

The DFA can be adapted to Protocol |3] straightforwardly. 

Theorem 30. Suppose that for each 9 G 6, there exist reals c e > 1 and rf > 
such that the functions 

cxp v s g(u) 



(they are direct analogs of q g defined by (|13p ) are forecast- continuous and have 
the supermartingale property for all g G S^ 8 • Then for any initial distribution 
Pq G there is Learner's strategy (in fact, the DFA applied to Q p ° defined 

by (|20p ) guaranteeing that, at each step N and for all experts 9, it holds 



L^<c e L s N+ C ^\n- ' 



if P {9) 

Proof. Similarly to the proofs of Theorems [5] and [26j we can construct the 
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supermartingale Q p ° : 

< 3 P °({ 7I "i}eee,7ri, wi, . . . , {TT%} Se e, n N ,uj N ) = 



iV 



J2 Po(e) n exp U ( ^^y^ - ) ) (20) 

see n=l V V ' / / 

and choose ttn satisfying ([9]) with the help of Lemma |H The loss bound follows 
in the same way as in Theorem [5] □ 

Protocol [3] can handle also the following task. We have several experts and 
several candidates for the loss function, and a priori some experts may perform 
well for two or more of the loss functions. In this case, it is natural to require 
that Learner's loss is small with respect to every expert and with respect to every 
loss function. A simple trick reduces the task to Protocol |3 for each original 
expert (supplying us with a prediction), we consider several new experts who 
announce the same prediction but use different loss functions. If our predictions 
are good in the game with these new experts then our predictions are good in 
the original game with respect to any of the loss functions. 

For example, assume that we want to compete with K experts according 
to the logarithmic loss function and square loss function in the game with out- 
comes {0,1}. Lemmas [S] and [7] imply that the following function is a forecast- 
continuous supermartingale: 

1 K / N \ 

^ £ eXP £ (~ 111 Kn + ln ^( U »)) ] 

k=l \n=l ) 

K / N \ 

+ 2K £ exp 2 £ ^" (1))2 (cJ » 7r » (1))2 ) ' 

fc=l V n=l / 

where 7r^ is the prediction of Expert k and ir n is the prediction of Learner. 
Choosing 7r„ according to Lemma SI we can achieve that the regret term with 
respect to the logarithmic loss function is bounded by ln(2i^) < In K + 0.7, and 
the regret with respect to the square loss function is bounded by 0.51n(2i^) < 
0.5 \nK + 0.4 — practically the same as the regrets against K experts that are 
achievable when we compete with respect to only one of the loss functions. 
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Appendix 



Proof of Lemma\^ Given the function q, let us define the following function <f) 

on v(ci) x v(ny. 

<p(ir',ir) = E„>q(ir,-) . 

For each fixed tt', the function ^(tt', •) is continuous, since q is continuous. 
For each fixed 7r, the function <p(-, tt) is linear, and thus concave. Note also that 
"P(f2) is a convex compact set. Therefore, <fi satisfies the conditions of Ky Fan's 
minimax theorem (see e.g. [TJ Theorem 11.4]), and thus there exists tt G V(Ci) 
such that for any 7r' G "P(^) it holds that 

E w /g(7f, •) = 0(7r', 7r) < sup 0(7T, 7r)= sup E Tr q(7T, •) < C (21) 
Trepfo) Tre-p(n) 

It is easy to see that tt has the property that the lemma must guarantee: 
q(ft>u) < C for all u! G f2. Indeed, if we substitute the distribution 5^ (which is 
concentrated on u>) for tt' in (j2"Tj) . the left-hand side will be just q(Tt,ui). □ 

Lemma [S] is a very important statement in our supermartingale framework, 
so let us outline an alternative proof for it (for details see (TU1 Theorem 6] , [TT1 
Theorem 16.1] or [30l Theorem 1]). Consider the sets = {tt \ q(n 7 uj) < C}. 
These sets are closed and for any S7o C fl the union U^gaoF^ contains all the 
measures concentrated on Qq. Then all F u has a non-empty intersection by 
Sperner's lemma. 

Lemma 31. Let a function q: V°(£l) x Q — >• M be non-negative and forecast- 
continuous on V°(Q). Suppose that for any tt G "P°(fi) it holds that 

E»g(7r,-) < C, 
where C G [0, oo) is some constant. Then it holds that 

3tt e r°(n)yuj e n g(?r, u) < (1 + e)c . 

Proof. Let S G (0, 1) be a constant to be chosen later. 

Let V s (ft) = {tt G V(Q.) tt(u) > 5}. This set is a non-empty con- 

vex compact subset of V°(Sl). Repeating the construction from the proof of 
Lemma[S]and applying Ky Fan's theorem for the function on V s (£1), we get that 
there exists tt G V s (fl) such that for any tt' G V s (CI) it holds that E^^tt, •) < C. 

For each ujq, consider the distribution tt$ -UJo such that TT$ tUjQ (uj) = S for ui =/= ujq 
and TTs^o^o) = 1 — °~(\Q\ — !)• Substituting tts.^, for tt', we get 

(l-S(\n\-l))q(7f,LJ ) + 6 ]T q(*,u)<C. 

Since q(TT,uj) > (the supermartingale S is non- negative) , the last inequality 
implies that (1 — <5(|£1| — l))g(7r, ujo) < C. It remains to note that we can choose 
S so small that 1/(1 - 5(\tt\ - 1)) < 1 + e. □ 

Proof of LemmaWTX According to Lemma I3T1 we can find TTk G V°(Cl) such that 

V^G O q(TT k ,L0) < (l+~ )C. 
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Since V(Cl) is compact, there exists a strictly increasing index sequence k(j), 
j G N, such that the sequence ^k(j) converges to some 7r G V(Cl). 

The points gj = q{-K k ^,-) belong to a compact set [0,2C] n . Hence there 
exists a strictly increasing index sequence j(i), i £ N, such that the sequence 
converges to some g$. For every w 6 fi, we have gj(<-o) — q^ir^j^io) < 
(1 + l/k(J))C, therefore 



g (u) = \img^ i} (uj) < liin ^1 + 



C = C. 



Km), 

It remains to set Tr*- 1 -* = irk{j{i)) and note that q(ir( l \uj) = gj(i). □ 

Proof of LtmmaWA Let : Cl — > [0, oo] be the constant zero function (that is, 
0(w) = for all u G Cl). If G E A then G <9E A and we can let = for 
any g. 

Assume that ^ Ea- Let V(g) = R{g)g, where R: E A — > (0, c] is defined by 
the following rule: R(g) = min{r G (0, c] | rg G E A } for any 5 G E A . 

Since the AA is (c, ^-realizable, it holds that cE A C Ea, that is, eg € Ea for 
any g € E A . The minimum is attained since Ea is compact (by Assumption [1} . 
Thus R(g) is well defined. It is obvious from the definition that V(g) — R{g)g 
belongs to the boundary 5E A of Ea for all g G E A . 

It remains to check that V(g) = R{g)g is continuous in g. We prove that R is 
continuous, namely, we take any <?, — ► g and for any infinite subsequence {gi k }, 
we show that if R(gi k ) converges then lim^ R(gi k ) = R(go)- If R(9i k ) converges 
then R(g ik )g ik converges, and \im k R(g ik )g lk = (lim fc R{g t k ))g G E A since E A 
is compact. Therefore R(go) < lim^ R(gi k ). For the other inequality, consider 
R(g'\g) = min{r G (0, 00) | rg' > V(g)} for g,g' G E A such that if g(u) ^ for 
some u G Cl then c/(w) 7^ too. Clearly, the function R(g'\g) is continuous in g' 
for any fixed 5 (note that R(g'\g) = max w: g ( w )^ V(ff)M/ff'(w)) and i?(#|flO = 
i?(.g). Since rg' > V(</) G E A implies rg' G E A , we have R{g'\g) > R(g')- 
In particular, R(gi k \go) > R(di k ) (assuming fc large enough so that g(u) 7^ 
implies g ik (u) 7^ 0) and R(g ) = hm fc i?(g ifc \g ) > lim fe R{g ik ). □ 

Proof of Lemma\14\ Assume that Xi(tt,luq) 7^ A2(7r,wo) and 7r(wo) > for some 
cjo G Cl. 

Since Ai(7r, •) and A2(7r, •) belong to E A , the point 

1 g— »Ai(iV) _|_ e -r)A 3 (7r,-) 

.9 = In ^ 

77 2 

also belongs to E A by the definition of E A . 

For any reals x, y, we have (e x + e y )/2 > e( x+v '' 2 , and the inequality is 
strict if x 7^ y. Therefore, g(uj) < (Ai(7r,w) + \2(it,uj))/2 for all u G Cl and 
g(u>o) < (Ai(7r,a>o) + A2(7r, ojq))/2. Multiplying these inequalities by n(ui) and 
summing over all lu G Cl, we get 

V«9 < -(E 7r Ai(7r,-)+E 7r A 2 (7r,-)) 

(recall that tt(loq) > 0). Since Ai and A2 are proper with respect to E A , we 
have E w Ai(7r, •) < E^g and E Tr A2(7r, •) < E w g. Hence we get a contradiction 
E n g < E^g. □ 
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For a convex function U : M° — > [—00, 00], a subgradient at point x £ MP is 
a point x* £ M. n such that 

Vz e M n I7(z) > U(x) + (x*,z-x). 

Lemma 32. Suppose that Y is a non-empty closed convex subset of [0,oo)^. 
Let U : M. n — > (—00, 00] be the function 

U(x) = - mi(x,y) , 

where (x,y) = X)wGf2 x ( UJ )v( UJ ) * s the scalar product in R . TTien U(x) is a 
convex function, and for any tt g 'P(fi), i£ ZioZds i/iai f (tt) < a 716 ' t 1 "* 0, 
subgradient of U at the point tt if and only if — tt* G Y and (tt, — 7r*) = — C/(7r). 

Proof. Since Y is not empty, the infimum is finite, and therefore U (x) > —00 
for all x. 

For any a £ [0, 1] and any X\,X2 £ R , we have U(axi + (1 — ajaia) = 
-infj /e y(a(xi,y) + (l-a)(x 2 ,y)) < - inf aS y a(xi, y) - raf ye y(l - a)(x 2 , y) = 
aU{x\) + (1 — a)?7(x2), thus {J is convex. 

Let us fix some tt £ V(0). Then (n, y) > for all y £ Y, and U(tt) < < 00. 

Let — 7r* g y and (tt, -tt*) = -[/(tt). Then U(tt) + (tt*, z-tt) = -(-%*, z) < 
— miy^Y (z, y) — U(z) for any z, thus tt* is a subgradient of U at 7r. 

Let tt* be any subgradient of U at 7T. Assume that — tt* ^ y. Then — tt* and 
y can be strongly separated by Corollary 11.4.2 in [T5], and Theorem 11.1(c) 
there implies that there exists z £ M° such that inf ae y (y, z) > (—tt* , z). Let us 
choose 6 > such that 

inf <!/,*) >5+(-7r*,z), 

and then choose yo £ Y such that 

(7r + z, y ) < inf (tt + z,y) + 6 . 

From the definition of the subgradient, we get U(tt + z) > U(tt) + (tt* , z), and 
thus 

(tt + z, y ) - S < inf (tt + z,y) < inf (tt, y) + (-n*,z) < (tt, y Q ) + (~tt*,z) . 

yeY y£Y 

So, (z,yo) < S + (— tt*,z), which contradicts the choice of S. This means that 

-TT* £ Y. 

It remains to note that the definition of the subgradient implies £7(0) > 
J7(7r) + (7r*,0-7r), and since U(0) = 0, we get miy £Y (y, n) = —U(tt) > (-tt*,tt). 

□ 

Proof of Lemma \15l By Assumption [21 there exists a finite point ga n in H 
[0,oo) , where E^g^n is finite for any tt. By Assumption [TJ Y,l is compact, and 
therefore the minimum is attained for all tt £ M. n . Thus H is well defined. Note 
also that H (tt) > for tt £ V(£i) and H(tt) = —00 if tt(lu) < for some lu £ O. 
Now let us show that 

H(tt) = inf E„g . 

gesXn[o : oo) n 
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Again by Assumption [2J the infimum is taken over a non-empty set. If tt(uj) < 
for some w £ f! then H(ir) — — oo and the infimum is equal to — oo as well. 
Thus we need to consider only the case when tt(u) > for all uj £ f2 and the 
minimum in the definition of H is attained at a point g such that g(u>) — oo for 
some values of uj. Note that for these uj we have tt(uj) — 0, since H(ir) < oo. 
Choose a sequence j„eEjn [0, oo)° that converges to g (for example, consider 
the segment between the points e~ n9 and e _T,£ten , and take a sequence e~ V9n 
along this segment). Since g n {u) and g(u>) are finite for non-zero n(ui), we get 
E-n-5n Ex.? = H(it), and thus the infimum is not greater than H(w). 

Now we can apply Lemma l32l with Y — E A n [0,oo) n and U(n) = —H(ic). 
It implies that for any ir £ ~P(fl), the set of subgradients of U at ir is the 
set of points where the infimum of E T g over g £ E A n [0, oo) n is attained. If 
tt £ V°(SY), the infimum is attained indeed, and it is unique by Lemma 1141 
By Theorem 25.1 in [T5], the function H is differentiable at tt, and the point 
X(tt, •) = argmin 9eE ^7 E n g is the gradient of H. Thus W 2 "P°(f2). 

On the other hand, if H is differentiable, the set of subgradients consists of 
one element only, the gradient. Theorem 25.5 in [TS] implies that the gradient 
mapping tt h- > A(tt, •) is continuous on H. □ 

Proof of Lemma \TSl Due to Assumption [TJ .ME A is compact and therefore con- 



tains all its limit points, that is, 7 £ MTP 
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Let E+,<7 be a shorthand for XLefi 7r(w)^o 7r '( w )ff( w ) f° r an y 9 £ [0> an d 
tt' eP(d). By definition, E^ = E+7. 

Note first that E^g converges to E^g for any finite g £ [0, 00) . 
Note also that E+7j converges to E+7. Indeed, E+7; < E ni ji < E Tj jg n < 
Yluen 5fin( w ) < °°! where gfi n 6 E A n [0,oo) n exists by Assumption [2] If 
7r(w) 7^ then 7T;(u;) is separated from for sufficiently large i, therefore 7i(cj) 
are bounded, and their limit 7(0;) is finite. And for finite limits 7 and tt, the 
convergence is trivial. 

Fix any go £ [0, oo) n and any e > 0. For sufficiently large i, we have 
E+7; > E + 7 — e and E^go < E^go + e. Taking into account that E+tj < E^^j 
and E Wi 7i = min geS ^ E^g < E Vi g , we get E^ < E OT g + 2e. Since e and 50 
are arbitrary, we have 

E^7 < inf E n g , 

9 esXn[o,oo)" 

and the last infimum can be replaced by min flgS ^ as shown in the proof of 
Lemma [15] □ 

Proof of Lemma We construct a continuous mapping F: E A — > MY>\ as a 
composition of mappings F w for all cj £ fi. Each F w when applied to g £ E A 
preserves the values of g{p) for o/w and decreases as far as possible the value 
g(ui) so that the result is still in E A . Formally, F UJ (g) = g' such that <?'(o) = g(o) 
for o ^ w and g'(u;) = min{ 7(0;) | 7 € E A , Vo 7^ a; 7(0) = <?(o) }. 

Let us show that each i 7 ^ is continuous. It suffices to show that F u (g)(uj) 
depends continuously on <?, since the other coordinates do not change. We will 
show that F UJ (g)(uj) is convex in g, continuity follows (see, e.g. [H]). Indeed, 
take any t £ [0,1], and 51,52 € E A . Since E A is convex, then tg\ + {l — t)g% £ E A 
and tF M {g\) + {l — t)F u {gi) £ E A . The latter point has all the coordinates o ^ uj 
the same as the former. Thus, by definition of F u , we get F u] (tgi + (l— t)g2)(uj) < 
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(t-FLCffi) + (1 - t)F u {g 2 )){u) = tF u ( 9l )(oj) + (1 - t)F u (g2)(u), which was to be 
shown. 

All F u do not increase the coordinates. Since the set T,\ contains any point g 
with all its majorants, F ul (gi) = g\ implies that F u (g2) — 32 for any g^ obtained 
from gi by applying any F U ' ■ Therefore, the image of a composition of F w over 
all u) e fi, is included in MT,\. □ 
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